Sampling for queries

ABSTRACT

Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled in one of many known ways to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data. Further methods involve the use of weighted sampling and weighted selection of outlier values for low selectivity queries, or queries having group by

FIELD OF THE INVENTION

[0001] This invention relates generally to the field of computers, andin particular to improving sampling techniques for more accurateestimations of queries of large databases.

COPYRIGHT NOTICE/PERMISSION

[0002] A portion of the disclosure of this patent document containsmaterial that is subject to copyright protection. The copyright ownerhas no objection to the facsimile reproduction by anyone of the patentdocument or the patent disclosure as it appears in the Patent andTrademark Office patent file or records, but otherwise reserves allcopyright rights whatsoever. The following notice applies to thesoftware and data as described below and in the drawing hereto:Copyright© 2000, Microsoft Corporation, All Rights Reserved.

BACKGROUND

[0003] Decision support applications such as On Line AnalyticalProcessing(OLAP) and data mining tools for analyzing large databases aregaining popularity. Many databases are growing larger and larger, makingit difficult for application programs and computers to search them intheir entirety for each request for data, or query, by a user. This isparticularly true when the number of users desiring data from the samedatabase is also increasing.

[0004] Executing such applications on large volumes of data can also beresource intensive. Fortunately though, samples of the results ofqueries to the database back end can be used to enhance the scalabilityof application servers without compromising the quality of theiranalysis. In other words, only portions or samples of the database areused to provide an estimate of results without scanning the entiredatabase.

[0005] Typical decision-support applications use a structured querylanguage (SQL) database system as the backend data warehouse andcommunicate data retrieval requests via relational SQL queries. On alarge database, the cost of executing such queries against therelational backend can be expensive. Therefore, the ability of theapplication servers to work with samples raises the question of whethera database server can compute a sample of answers to queries withoutpaying the cost of executing the entire query. Such functionality wouldalso enable the database servers to scale efficiently. No matter howmany records are in the database, only a statistically relevant sampleof them would need to be checked to provide a fairly accurateestimation.

[0006] Data mining and statistical analysis involve techniques that canbe robust and effective even when supplied with merely a sample of queryresults. Likewise, OLAP servers that answer queries involvingaggregation (performing some function on the results to arrive at ananswer, e.g., finding the average) can significantly benefit from theability to present to the user an approximate aggregate computed from asample of the subset of the multidimensional data the user is interestedin analyzing.

[0007] It is important to recognize that sampling must be supported onthe result of ad-hoc SQL queries, not just on stored relations. Thereare several reasons for such a requirement. First, applications need toanalyze ad-hoc fragments of data, driven by user requirements, e.g.,sales data for a certain time period or sales data for a certainproduct. Next, great scalability benefits may accrue from speeding-upqueries that are expensive to compute in their entirety (complex SQLqueries). Such queries typically involve foreign key joins, selectionsand group by.

[0008] Although there has been a wealth of work in databases onsupporting sampling, no previous work has carefully examined thepracticality of sampling techniques for ad-hoc queries over arbitrarydatabases. Approximate query answering has recently received a lot ofattention, particularly in the context of Data Warehousing. Since mostqueries in the warehousing application involve aggregation, there is aneed to ensure that approximately answering aggregation queries usingsampling provides good results.

[0009] Sampling performs poorly when the distribution of the aggregatedattribute is skewed. In other words, if some values are very high orvery low compared to the rest of the values, sampling of the data mightnot find those “outlier” values, which could greatly affect the resultobtained. Such a database is said to exhibit data skew. Sampling is alsohard to use effectively in queries with low selectivity.

[0010] The potential effect of outlier values is illustrated by lookingat a small database of six records. The values to be aggregated are 1,2, 1, 4, 3, and 1000. The sampling of these values is based on lookingat every other record. The values obtained from this data given thesampling used are 1, 1, and 3. If trying to estimate an average, it isclear that the estimation will be far from the real answer because thevalue of 1000 was not included in the results. Perhaps the numbersrepresent sales results from various locations. If trying to estimatetotal sales, the estimate will be far to low.

[0011] There are many other examples of how outlier values candramatically affect the results of sampling a database. One exampleinvolves 10,000 tuples of which 99% have value 1 in the aggregatecolumn, while the remaining 1% of the tuples are of value 1000. Considerusing a uniform random sample of size 100 to estimate the average valueof the column over all tuples. It is quite likely that the sample wouldnot include any tuple of value 1000, leading to an estimate of 1 for theaverage value. On the other hand, if perchance two or more tuples ofvalue 1000 were to be included in the sample, then the estimate of theaverage value would be more than 20.98. In either case, the estimatewould be far from the true value of the actual average, 10.99. It isonly in the case where exactly one tuple of value 1000 is in the samplethat a reasonable estimate of the average value would be obtained. Butthe latter event has probability 0.37, therefore there is highprobability that a large error in the estimate would result.

[0012] In this example it is observed that a skewed database ischaracterized by the existence of certain tuples that are deviant fromthe rest with respect to their aggregate value. These are referred to asoutliers. It should be noted that the presence of these tuples in thewrong proportion in the sample would also lead to large errors in theestimate of a query's result.

[0013] There are many examples one could provide to justify that indeedsuch skew exists in databases. Consider a database that maintains thepopulation for different countries. The population figures for countrieslike China and India are surely deviant from the figures for theremaining countries and lead to a skew. Suppose a database maintainednationwide sales figures for a range of items ranging from a particularluxury model of automobile to six-packs of soda pop. While the salesfigures for the automobile model would be very small, the sales figuresfor Coke could range in the millions. One potential solution to theproblem involves greatly increasing the sample size. Greatly increasingthe sample size results in losing the very benefit that samplingprovides. It requires more system resources and takes more time.

[0014] There is a need to properly account for such outlier valueswithout significantly affecting the speed and scalability benefitsprovided by sampling.

[0015] Low selectivity of queries can also contribute to significanterror in approximation of aggregate values. No single sample of thedatabase can answer all queries with low selectivity with sufficientaccuracy. Most queries involve selection conditions or Group-By's. Ifthe selectivity is low, then it dramatically and adversely impacts theaccuracy of sampling-based estimation. Selection queries partition therelation into two sub-relations: tuples that satisfy the condition(relevant sub-relation) and those which do not. Sampling in a uniformmanner results in the number of tuples that are sampled from therelevant sub-relation being proportional to its size. If the number islow, a very small relevant sample results, leading to large error. Thesame is true for Group-By queries which partition the relation intonumerous sub-relations (tuples that belong to specific groups).

[0016] There is also a need for an accurate way to estimate the resultsof queries, such as an aggregation of queries having selectionconditions and Group-By's. There is a need for such a way that does notgreatly increase sample sizes, and retains the benefits that samplingprovides for databases having data skew and low selectivity.

SUMMARY OF THE INVENTION

[0017] Given a query with selection conditions or Group-By's, workloadinformation is used to tune a sample by weighting tuples in the sample.Groups that are accessed more often in the workload are sampled at ahigher rate or probability. With each tuple that is included in thesample, the probability is stored. Each aggregate computed over thetuple is then multiplied by the inverse of the probability to normalizeit. Using this weighted sampling causes the sample to contain moretuples that are pertinent to the query, and fewer sampled tuples to bediscarded due to not meeting selection criteria where the accesspatterns of the workload are local, and the workload is a goodrepresentation of the actual queries that will be posed in the future.

[0018] Outlier indexes may also be selected using information from theworkload for queries with selection conditions or Group-By's. A querywith selection or Group-By implicitly defines one or more sub-relationsover which aggregates are computed. Workload information is used toweight each sub-relation. The variance in the aggregate column for eachsub-relation is calculated. An optional pruning retains onlysub-relations whose weighted variance is greater than a threshold.Memory is then allocated among all remaining sub-relations in proportionto their weighted variances. An outlier index is then built for eachsub-relation. The union of the outlier indexes provides the outlierindex for the given query. Queries are then estimated using both theweighted outlier index and the weighted sample.

BRIEF DESCRIPTION OF THE DRAWINGS

[0019]FIG. 1 is a block diagram of a computer system on which thepresent invention may be implemented.

[0020]FIG. 2 is a flowchart of a sampling process taking outliers intoaccount.

[0021]FIG. 3 is a flowchart of an outlier identification algorithm.

[0022]FIG. 4 is a graphic representation of windows used in thealgorithm of FIG. 3.

[0023]FIG. 5 is a flowchart of the use workload information in thecontext of weighted sampling.

[0024]FIG. 6 is a flowchart of the collection of weights for multiplesub-relations given a query with low selectivity.

[0025]FIG. 7 is a flowchart of the generation of an outlier index formultiple sub-relations given a query with low selectivity.

DETAILED DESCRIPTION

[0026] In the following detailed description of exemplary embodiments ofthe invention, reference is made to the accompanying drawings that forma part hereof, and in which is shown by way of illustration specificexemplary embodiments in which the invention may be practiced. Theseembodiments are described in sufficient detail to enable those skilledin the art to practice the invention, and it is to be understood thatother embodiments may be utilized and that logical, mechanical,electrical and other changes may be made without departing from thespirit or scope of the present invention. The following detaileddescription is, therefore, not to be taken in a limiting sense, and thescope of the present invention is defined only by the appended claims.

[0027] The detailed description is divided into multiple sections. Afirst section describes the operation of a computer system thatimplements the current invention. This is followed by a description ofthe problems of sampling when the data being sampled exhibits data skewor low selectivity, and a high level description of how aggregatequeries are estimated using a combination of outlier identification andsampling. Further detail regarding the identification of outlier data,the sampling techniques and how data from outliers and sampling arecombined is then described. One further section provides detailregarding the exploitation of workload information for queries withselection and group by. This is followed by a conclusion which describessome potential benefits and describes further alternative embodiments.

[0028] Hardware and Operating Environment

[0029]FIG. 1 provides a brief, general description of a suitablecomputing environment in which the invention may be implemented. Theinvention will hereinafter be described in the general context ofcomputer-executable program modules containing instructions executed bya personal computer (PC). Program modules include routines, programs,objects, components, data structures, etc. that perform particular tasksor implement particular abstract data types. Those skilled in the artwill appreciate that the invention may be practiced with othercomputer-system configurations, including hand-held devices,multiprocessor systems, microprocessor-based programmable consumerelectronics, network PCs, minicomputers, mainframe computers, and thelike which have multimedia capabilities. The invention may also bepracticed in distributed computing environments where tasks areperformed by remote processing devices linked through a communicationsnetwork. In a distributed computing environment, program modules may belocated in both local and remote memory storage devices.

[0030]FIG. 1 shows a general-purpose computing device in the form of aconventional personal computer 20, which includes processing unit 21,system memory 22, and system bus 23 that couples the system memory andother system components to processing unit 21. System bus 23 may be anyof several types, including a memory bus or memory controller, aperipheral bus, and a local bus, and may use any of a variety of busstructures. System memory 22 includes read-only memory (ROM) 24 andrandom-access memory (RAM) 25. A basic input/output system (BIOS) 26,stored in ROM 24, contains the basic routines that transfer informationbetween components of personal computer 20. BIOS 26 also containsstart-up routines for the system. Personal computer 20 further includeshard disk drive 27 for reading from and writing to a hard disk (notshown), magnetic disk drive 28 for reading from and writing to aremovable magnetic disk 29, and optical disk drive 30 for reading fromand writing to a removable optical disk 31 such as a CD-ROM or otheroptical medium. Hard disk drive 27, magnetic disk drive 28, and opticaldisk drive 30 are connected to system bus 23 by a hard-disk driveinterface 32, a magnetic-disk drive interface 33, and an optical-driveinterface 34, respectively. The drives and their associatedcomputer-readable media provide nonvolatile storage of computer-readableinstructions, data structures, program modules and other data forpersonal computer 20. Although the exemplary environment describedherein employs a hard disk, a removable magnetic disk 29 and a removableoptical disk 31, those skilled in the art will appreciate that othertypes of computer-readable media which can store data accessible by acomputer may also be used in the exemplary operating environment. Suchmedia may include magnetic cassettes, flash-memory cards, digitalversatile disks, Bernoulli cartridges, RAMs, ROMs, and the like.

[0031] Program modules may be stored on the hard disk, magnetic disk 29,optical disk 31, ROM 24 and RAM 25. Program modules may includeoperating system 35, one or more application programs 36, other programmodules 37, and program data 38. A user may enter commands andinformation into personal computer 20 through input devices such as akeyboard 40 and a pointing device 42. Other input devices (not shown)may include a microphone, joystick, game pad, satellite dish, scanner,or the like. These and other input devices are often connected to theprocessing unit 21 through a serial-port interface 46 coupled to systembus 23; but they may be connected through other interfaces not shown inFIG. 1, such as a parallel port, a game port, or a universal serial bus(USB). A monitor 47 or other display device also connects to system bus23 via an interface such as a video adapter 48. In addition to themonitor, personal computers typically include other peripheral outputdevices (not shown) such as speakers and printers.

[0032] Personal computer 20 may operate in a networked environment usinglogical connections to one or more remote computers such as remotecomputer 49. Remote computer 49 may be another personal computer, aserver, a router, a network PC, a peer device, or other common networknode. It typically includes many or all of the components describedabove in connection with personal computer 20; however, only a storagedevice 50 is illustrated in FIG. 1. The logical connections depicted inFIG. 1 include local-area network (LAN) 51 and a wide-area network (WAN)52. Such networking environments are commonplace in offices,enterprise-wide computer networks, intranets and the Internet.

[0033] When placed in a LAN networking environment, PC 20 connects tolocal network 51 through a network interface or adapter 53. When used ina WAN networking environment such as the Internet, PC 20 typicallyincludes modem 54 or other means for establishing communications overnetwork 52. Modem 54 may be internal or external to PC 20, and connectsto system bus 23 via serial-port interface 46. In a networkedenvironment, program modules, such as those comprising Microsoft® Wordwhich are depicted as residing within 20 or portions thereof may bestored in remote storage device 50. Of course, the network connectionsshown are illustrative, and other means of establishing a communicationslink between the computers may be substituted.

[0034] Software may be designed using many different methods, includingobject oriented programming methods. C++ and Java are two examples ofcommon object oriented computer programming languages that providefunctionality associated with object oriented programming. Objectoriented programming methods provide a means to encapsulate data members(variables) and member functions (methods) that operate on that datainto a single entity called a class. Object oriented programming methodsalso provide a means to create new classes based on existing classes.

[0035] An object is an instance of a class. The data members of anobject are attributes that are stored inside the computer memory, andthe methods are executable computer code that act upon this data, alongwith potentially providing other services. The notion of an object isexploited in the present invention in that certain aspects of theinvention are implemented as objects in one embodiment.

[0036] An interface is a group of related functions that are organizedinto a named unit. Each interface may be uniquely identified by someidentifier. Interfaces have no instantiation, that is, an interface is adefinition only without the executable code needed to implement themethods which are specified by the interface. An object may support aninterface by providing executable code for the methods specified by theinterface. The executable code supplied by the object must comply withthe definitions specified by the interface. The object may also provideadditional methods. Those skilled in the art will recognize thatinterfaces are not limited to use in or by an object orientedprogramming environment.

[0037] Effects of Skew and Low Selectivity

[0038] The effects of skew and low selectivity can adversely affect theaccuracy of sampling-based estimations.

[0039] The following example demonstrates the adverse impact of skew onthe applicability of sampling.

[0040] Example 1. Suppose there are 10,000 tuples of which 99% havevalue 1 in the aggregate column, while the remaining 1% of the tupleshave value 1000. Consider using a uniform random sample of size 100 toestimate the average value of the column over all tuples. It is quitelikely that the sample would not include any tuple of value 1000,leading to an estimate of 1 for the average value. On the other hand, ifperchance two or more tuples of value 1000 were to be included in thesample, then the estimate of the average value would be more than 20.98.In either case, the estimate would be far from the true value of theaverage which is 10.99. It is only in the case where there is exactlyone tuple of value 1000 in the sample that we would obtain a reasonableestimate of the average value. But the latter event has probability only0.37. Therefore, with high probability a large error in the estimatewould result.

[0041] In this example it is observed that a skewed database ischaracterized by the existence of certain tuples that are deviant fromthe rest with respect to their contribution to the aggregate value.These tuples are referred to as “outliers.” The presence of outlierscauses a large “variation” in the aggregate values which is directlyproportional to the standard error in sampling as demonstrated by thefollowing theorem. The theorem quantifies the error for estimatingaggregate queries that calculate sum or average of values defined overthe tuples. The value associated with a tuple could be a function of oneor more of its attributes (columns). For example, if the query is selectsum (quantity*price) from line item then the function ƒ(t) over thetuples is ƒ(t)=t.quantity*t.price.

[0042] Theorem 1. Consider a relation of size N and let {y₁, y₂, . . . ,y_(N)} be the set of values associated with the tuples in the relation.Let U be a uniform random sample of the y_(i)'s of size n. Then thesample mean {overscore (y)}=1/nΣ_(y) _(i) _(εU)y_(i) is an unbiasedestimator of the actual mean {overscore (Y)}=1/NΣ_(i=1) ^(N)y₁ with astandard error of $\begin{matrix}{{\in {= {\frac{S}{\sqrt{n}}\sqrt{1 - \frac{n}{N}}}}}{where}{S = \sqrt{\frac{\sum\limits_{i = 1}^{N}\left( {y_{i} - \overset{\_}{Y}} \right)^{2}}{N - 1}}}} & (1)\end{matrix}$

[0043] If there are outliers in the data then the quantity S is verylarge, in which case the sample size n should be increased to meet theerror bounds. For instance, consider two sets of values each containingN elements. Let S and 2S be the variance of the two sets. If a samplesize of n is sufficient to ensure error bounds for the first set, asample size 4n is required to provide the same guarantees for the secondset. In general, the sample size would have to increase quadraticallywith the variance thereby hampering the efficiency gained from sampling.Earlier works assume that the aggregate attributes are not skewed.

[0044] Since most queries involve selection conditions and/orGroup-By's, it is important to study their interaction with sampling. Ifthe selectivity of a query is low, then it dramatically and adverselyimpacts the accuracy of sampling-based estimation. A selection querypartitions the relation into two sub-relations: tuples that satisfy thecondition (relevant sub-relation) and those which do not. With samplinguniformly from the relation, the number of tuples that are sampled fromthe relevant sub-relation is proportional to its size. If this number issmall (low selectivity), very small relevant sample size is obtained,leading to large error. Consider the standard error (ε) in uniformsampling as given by Theorem (1). If n is the sample size for the entirerelation and q is the selectivity of the query, then the expected numberof samples that satisfy the selection condition is given by qn. Sincethe error is inversely proportional to {square root}{square root over(qn)}, a low selectivity q will give rise to large error due tosampling. The same is true for Group-By queries which partition therelation into numerous sub-relations (tuples that belong to a specificgroup). The main point to note is that for uniform sampling to performwell, the importance of a sub-relation should be reflected by its size,which is not the case in general.

[0045] Use of Outlier Indexes to Approximate Aggregate Queries

[0046] As indicated above, datasets may contain outlier values, whichare values that occur with low frequency, but have a high contributionto an aggregate of the data being computed. Uniform random sampling isnot sufficient to provide an accurate estimate of the aggregate when thedata being sampled contains such outliers. An improved techniqueaccounts for such outliers in combination with sampling of the data. Ahigh level simplified flowchart of the steps involved is shown in FIG.2. The functions described in this and other flowcharts may berepresented in a computer program stored on a computer readable mediumand performed on the computer system of FIG. 1. The computer readablemedium may also comprise carrier waves for transmitting the programbetween computers and various storage medium.

[0047] In FIG. 2, processing starts at 210. A query is received from auser at 210 and outliers are identified from the data to which the queryis directed at 220, and as further described with respect to FIGS. 3 and4. The outliers identified are then indexed at 225. It should be notedthat the identification of outliers and indexing of them is normallyperformed prior to receipt of queries, but may also be performed priorto the query being received. This greatly speeds up the processing ofthe query, which in many cases is real time, with a user waiting forresults.

[0048] Following creation of the index of outliers at 225, the outliersare then subjected to the query and aggregated at 230. This provides anoutlier aggregate, such as a sum of all the outliers meeting the querycriteria. A sample is obtained at 232. It may be obtained in manydifferent ways, such as during a preprocessing phase prior to executionof the query. Next, the outliers are removed from the data at 235. Suchdata is then sampled, subjected to the query, and aggregated. The resultof this is extrapolated based on the frequency of the sampling and thetotal number of non-outliers to provide an estimated aggregate. Theestimated sum is combined with the outlier aggregate to provide anestimate of the aggregate of the entire dataset at 240, and control isreturned at 250.

[0049] The overall process of FIG. 2 is now described in further detail.Given a query Q which aggregates over column C of relation R, anexisting “outlier index” for C is used along with a uniform sample of Rto approximately answer Q. An “outlier index” R_(O) may be thought of asa sub-relation of the original relation R.

[0050] The table R is partitioned into two subtables R_(O) (outliers)and R_(NO) (non-outliers), i.e., R_(O) is the set of tuples that belongto the outlier index. An aggregation query Q can be considered as the“union” of two sub-queries, the first of which is Q applied to theR_(O), while the second is Q applied to the R_(NO). This leads to thefollowing scheme for approximately answering an aggregate query for agiven choice of the outliers R_(O). To illustrate the scheme, theexample query—select sum(sales) from lineitem is used.

[0051] 1. Determine Outliers—specify the sub-relation R_(O) of thetuples deemed to be the outliers and create an index (the outlier index)for these tuples. In the example, a view called lineitem_outlier iscreated, which will be appropriately indexed.

[0052] 2. Sample Non-Outliers—select a uniform random sample S′ of therelation R. Remove any tuples from R_(O) that are in S′ to obtain asample S of R_(NO). In the example, the lineitem table is sampled, anytuples from the sample that are part of lineitem_outlier are removed,and the remaining tuples in the sample are materialized in a sampletable called lineitem_(— samp.)

[0053] 3. Aggregate Outliers—apply the query to the outliers in R_(O)accessed via the outlier index. In the example, this corresponds tocomputing sum(sales) for the view lineitem_outlier.

[0054] 4. Aggregate Non-Outliers—apply the query to the sample S andextrapolate to obtain an estimate of the query result for R_(NO). In theexample, this corresponds to computing sum(sales) for the tablelineitem_samp and then multiplying the result by the inverse of thesampling fraction (extrapolation).

[0055] 5. Combine Aggregates—combine the approximate result for R_(NO)with the exact one for R_(O) to obtain an approximate result for R. Inthe example, this means adding sum(sales) for the view lineitem_outlierto the extrapolated sum(sales) for lineitem_samp.

[0056] It should be noted that 1 and 2 above are preprocessing steps andare executed once, while the remaining steps are online and are executedfor each query. In 5, the approximate result for R_(NO) is combined withthe result for R_(O) to obtain an approximate result for R. Thisrequires that the aggregate function have the following property: Let ƒbe a function defined over a multiset of values. If ƒ(X) and ƒ(Y) arethe function values over the multisets X and Y respectively, then ƒ(X∪Y)may be calculated without “looking” at the multiset (X∪Y), using onlyƒ(X), ƒ(Y), and possibly some simple statistics like |X| and |Y| overthe multisets. Aggregate functions like sum and avg have this property.Although the basic framework of outlier indexing can be applied to anyfunction ƒthat satisfies the above constraint, identifying thesub-relation for outlier indexes is dependent on the specifics of ƒ.

[0057] A simplified overview of the manner in which outlier values areselected is shown in FIG. 3. Starting at 310, a threshold, L, for thenumber of tuples or rows of a dataset are determined. This threshold maybe determined based on the availability of system resources that can bededicated to improving the accuracy of an estimated aggregation, and canalso be based on a desired accuracy of the estimation. The threshold isbasically the number of outlier values which can be taken into account,and may be set by a database system administrator. Further detail isprovided in the alternative embodiments section.

[0058] Given the threshold, values in a column being aggregated aresorted at 320. L corresponds to the number of values, which may be equalto the number of tuples in the database if there are no selectionconditions. The values are sorted in ascending or descending order usingany available sorting algorithm. Once the values have been sorted, anindex of outlier values is created. A sliding window is employed toidentify sets of outliers. In other words, the window is of size N-L,and is slid along the column of values, starting at the first value inblock 330. At this point, the first N-L values are situated within thewindow, and the last L values are outside of the window. A variance ofsuch values is calculated at 340. The window is then moved down onevalue in the sorted list, such that the first value is no longer in thewindow. Since the window is a constant size in one embodiment, L-1values at the high end of the list are now excluded from the window. Asthe window is slid one value at a time, the variance of the valueswithin each window of size N-L is determined at 350. The variance isdetermined in one embodiment by maintaining it for a dynamic set ofnumbers subject to insertions and deletions. Moving the window one valueresults in both an insertion and deletion rather than a recalculation ofthe variance using all the numbers in the set or window.

[0059] The index is created by identifying the window having the lowestvariance, and including all values outside of the window in the outlierindex at 360. Several variations of the above are possible, includingsliding the window more than one value at a time. In one embodiment, thewindow moves multiple values between calculations of the variance. Inthis embodiment, once the low variance window is determined, the windowmay be repositioned one value at a time about such low variance windowand the variance at each window calculated. Again the window with thelow variance is selected and outliers thus identified. In still furtherembodiments, it is recognized that once a lowest variance isencountered, and the variance keeps increasing, one may stop sliding thewindow and select the window with the lowest variance alreadycalculated.

[0060] A simple example of selecting the outlier values and finding theestimated aggregate is described with respect to FIG. 4. Twelve valuesare shown, corresponding to twelve tuples in a database. The values areshown four times, in columns 420, 425, 430 and 435. A window is alsoshown in each column, identifying nine entries in each column. Thewindows are slid down each column, such that the first window 440includes the first nine values. The second window 445 covers values twothrough 10, the third window covers values three through eleven, and thefourth window covers values four through twelve. Each window is shownwith a variance corresponding to the variance of the values in thewindow. The variance of column 425 is lowest, and defines the values tobe sampled. In this example, every third value is included in thesample, and there is memory allocated to three outliers.

[0061] Further, in this example, the sum of the column of values is thedesired aggregate. The actual sum is easily seen to be two hundredtwenty two. To estimate the sum in accordance with the invention, thesum of the outliers, corresponding to values one, forty, and eighty isone hundred twenty one. Every third sample of the values in the window445, include the values ten, fourteen and fifteen. The sum of thesesamples must be multiplied by the inverse of the sample rate, or three,yielding a sum of one hundred seventeen, for a total estimated sum oftwo hundred thirty eight, as shown at 465. Estimated sums for the otherwindow positions are also shown at 460, 470 and 480. It should be notedthat this example is only for illustration of the operation of theinvention, and is not actually a good candidate for estimation, sincethe total number of values is extremely small. In fact, column 460provides a slightly better estimate, even though the variance of thedata in the window was slightly higher than that of window 445. However,the variance of windows 450 and 455 was significantly higher, as werethe estimated sums using such windows.

[0062] Detail of Identifying Outlier Values

[0063] Further detail is now provided with respect to identifying theoutlier values and constructing outlier indexes.

[0064] This section describes further detail in selection of outlierswhere the aggregation is a sum. Since the relative error in estimatingthe average is the same as that for sum, the choice of outliers does notchange when the aggregation is average.

[0065] There are two points to be observed: (1) the query error is dueto the error in estimating the aggregate of the non-outliers from theirsample and (2) there is an extra cost (above that incurred by puresampling) due to the need for maintaining and accessing the outlierindex. Therefore, outlier set R_(O) is selected to minimize theapproximation error for a class of queries subject to the constraintthat R_(O) contains at most τ tuples from the relation R. The followingdefinition identifies an optimal choice of the outlier set R_(O).

[0066] Definition 1. For any sub-relation R′⊂R, let ε(R′) be thestandard error in estimating Σ_(tεR\R′)t.C using random samplingfollowed by extrapolation. An optimal outlier set R_(O)(R, C, τ) for acolumn C in a relation R with a threshold τ is defined as a sub-relationR_(O)⊂R such that

[0067] |R_(O)|≦τ, and

[0068] ε(R_(O))=min_(R′⊂R∥R′|≦τ){ε(R′)}.

[0069] Essentially the outlier index is defined as an optimalsub-relation R_(O) that leads to the minimum possible sampling error,subject to the constraint that R_(O) has at most τ tuples in therelation R. The sampling error is the error in estimating the aggregatevalue over the tuples not included in R_(O) using the standardsample-and-extrapolate strategy. The error is directly proportional tothe variance S given by$S = {\sqrt{\frac{\sum\limits_{i = 1}^{N}\left( {y_{i} - \overset{\_}{Y}} \right)^{2}}{N - 1}}.}$

[0070] Let S(R′) be the variance for any sub-relation R′⊂R. Then anoutlier index R_(O) is a sub-relation of size at most τ such that thecomplement R\R_(O) sub-relation has minimum variance S(R\R_(O)). Thefollowing theorem assists in choosing such a sub-relation efficiently.

[0071] Theorem 2. For any multiset A=(y₁, y₂, . . . , y_(n)} let thevariance S(A) be defined as$S = \sqrt{\frac{\sum\limits_{i = 1}^{N}\left( {y_{1} - \overset{\_}{Y}} \right)^{2}}{N - 1}}$

[0072] where {overscore (Y)}=1/NΣ_(i=1) ^(N)y_(i) is the mean of themultiset. Consider a multiset R={y₁, y₂, . . . , Y_(N)} where they_(i)'s are in sorted order. Let R_(O)⊂R be the subset such that

[0073] |R_(O)|≦τ, and

[0074] S(R\R_(O))=min_(R′⊂R∥R′|≦τ){S(R\R′)}.

[0075] then R_(O)={Y₁, Y₂, . . . , y_(τ′), y_(N−τ+τ′+1), . . . , y_(N)}for some 0≦τ′≦τ.

[0076] The theorem states that the subset that minimizes the varianceover the remaining set consists of the leftmost τ′ elements (for some0≦τ′≦τ) and the rightmost τ−τ′ elements from the original multiset, whenthe elements are arranged in a sorted order. Thus, the selection of anoutlier index simplifies to finding the value τ′. This gives rise to thefollowing algorithm for outlier index selection, which follows theflowchart of FIG. 3.

[0077] Algorithm Outlier-Index (R, C, τ)

[0078] 1. Read the values in column C of the relation R. Let y₁, y₂, . .. , y_(N) be the sorted order of the values appearing in column C. Eachvalue corresponds to a tuple (320).

[0079] 2. For i=1 to τ, compute E(i)=S({y_(i), y_(i+1), . . . ,y_(N−τ+i−1)}), where S is the variance of the multiset as defined inTheorem 2 (350).

[0080] 3. Let τ′ be the index where E(i) takes its minimum value. Then,the outlier index is the tuples that correspond to the set of values{y₁, y₂, . . . , y_(96 ′−1), y_(N−τ+τ′), y_(N−τ+τ′), . . . , y_(N)}(360).

[0081] The efficiency of the algorithm crucially depends on the abilityto compute variance efficiently. We show how this can be done in lineartime, once the values in column C have been sorted. It is well knownthat quantities such as sum, mean, variance, etc. can be efficientlymaintained for a dynamic set of numbers subject to insertions anddeletions. In particular, each insertion or deletion can be handled inO(l) time. In implementing Step 2, we start by scanning the data tocompute E(l). We also make memory-resident copies of the first τ and thelast τ values of (y₁, y₂, . . . , Y_(N)}. (If they cannot fit intomemory we make copies on disk, however τ is usually small). After thiswe do not have to scan the data any more. This is because E(i+1) can beincrementally computed from E(i) since the value to be deleted (y_(i))and the value to be inserted (y_(N−τ+i)) can be looked up from thecopies. Thus, the running time of the algorithm is dominated by thesorting in Step 1.

[0082] Multiple aggregates may be handled by maintaining an outlierindex for each aggregate. However, outlier indexes for differentaggregates are likely correlated, and as a result have many tuples incommon (a tuple that is outlier for a certain column or aggregate mayalso be an outlier for other aggregates). Hence, an optimizationheuristic is to maintain the union of these indexes. A more preciseanalysis partitions the available memory space among outlier indexes formultiple aggregates in a judicious manner that takes into account theirrelative importance, e.g., by exploiting workload information.

[0083] Given sufficient space to store T tuples, the allocation ofmemory between samples and outlier index should be done in a manner thatminimizes error.

[0084] Let S(n) denote the variance in the non-outliers for an optimaloutlier index of size n. If we allocate the space such that we have ntuples in the outlier index and T−n tuples in the sample, then the erroras given by equation 1 is proportional to S(n)/{square root}{square rootover (T−n)}. Since the function S(n) does not have a closed-form formulaand depends on the data distribution, identifying an optimal allocationwould require evaluating S(n) for a set of evenly-spaced values of nfrom 0 to T; finding the value of n (n_(o)) for which S(n)/{squareroot}{square root over (T−n)} is minimized and distributing the memoryas n_(o) tuples in the outlier index and T−n_(o) tuples in the sample.

[0085] Extensions to Other Aggregates. We have described a procedure forselecting outliers for the case of sum and avg. Suppose we want toaggregate (sum or average of) ƒ(t), where ƒ is a real valued functiondefined over the tuples. Then, we can use the function values instead ofy_(i)'s in our algorithm and can determine the sub-relation thatminimizes the variance over the remaining set. One should note thatƒ(t)=1 for the count aggregate and in this case there is no variance inthe values. In other words, outlier indexing is not required for countaggregates and uniform sampling works well. It is easy to see that ifthe function is linear in y_(i)'s (i.e., ƒ(t)=ay_(i)+b), then the sameset of tuples that minimize the variance for y_(i)'s minimize thevariance for ƒ(t)'s and hence the outlier index defined for y_(i)'s canbe used. It is well known that uniform sampling works well forestimating the median. On the other hand sampling does not work well formin (and max) queries since intuitively, the min value may be muchsmaller than the next smallest value, and therefore our estimate of mincan be arbitrarily bad unless we happened to obtain the min value in thesample itself. However, we can use sampling to return a value that isnot too far away from the min or max in terms of the rank ordering,i.e., we can use sampling to return a value which, while not theminimum, would be guaranteed to be among the k smallest values for somesmall k; of course, the kth smallest value may be significantlydifferent from the smallest value.

[0086] Exploiting Workload Information for Sampling

[0087] The sliding window approach was described above for use inestimating aggregation queries where data skew existed. In this section,workload information is leveraged for estimating aggregation querieshaving low selectivity. The workload information is used both inobtaining the sample, and in creating the outlier index. By tuning thesample and outlier index to a representative workload (i.e., set ofqueries), the queries can be estimated more effectively.

[0088] The use of workload information for sampling and outlier indexinginvolves the following steps as shown in the flowchart of FIG. 5:

[0089] 1. Workload Collection 510: Obtain a workload representative ofthe queries posed against the database. Modern database systems providetools to log queries posed against the server (e.g., the Profilercomponent of Microsoft SQL Server).

[0090] 2. Trace Query Patterns 520: The workload can be analyzed toobtain parsed information, e.g., the set of selection conditions thatare posed.

[0091] 3. Trace Tuple Usage 530: The execution of the workload revealsadditional information on usage of specific tuples, e.g., frequency ofaccess to each tuple, the number of queries in the workload for which itpasses the selection condition of the query. Since tracking thisinformation at the level of tuples can be expensive, it can be kept atcoarser granularity, e.g., on a page-level.

[0092] Alternatively, batching of updates can be used to lower thisoverhead.

[0093] 4. Weighted Sampling 540: Perform sampling by taking into accountweights of the tuples (from 530).

[0094] 5. Weighted Outlier Indexing 550: Identify an outlier index thatcaptures outliers in popular queries using query pattern analysis (from540).

[0095] In principle, the precomputed uniform sample for the entirerelation can be used to derive a sample for each sub-relation created bya selection condition or a group-by. However, unlike the case whereaggregation is done over the entire relation in which there was exactcontrol on the sample size, in the current case direct sampling is notdone from the sub-relation. Instead, sampling is done from the entirerelation and then the selection filter is applied to the sampled tuples.If q is the selectivity, for a sample of size n, the effective samplefor the sub-relation would be of expected size nq but may differ fromthis expectation. To guarantee that with high probability at least nqtuples are sampled from the sub-relation, slightly more than n tuplesare sampled from the relation. In the following discussion this notionof “slightly” is quantified. Note that the discussion is valid for thecase of group-by queries too, where the selectivity for a group isdefined as the ratio N_(g)/N, where N_(g) is the number of tuples thatbelong to the group and N is the number of tuples in the relation.

[0096] Consider a group that has very few tuples, say N_(g). Supposeeach tuple with probability p is sampled, leading to an expected samplesize of N_(p). From Chernoff bounds, it is guaranteed that the actualnumber of tuples that are sampled from the group is at least (1−δ)N_(g)p(the expected value is N_(g)p) with confidence at least 1−γ from some0<γ, δ<1, provided N_(g)p is sufficiently large. Since the samplingfraction p is fixed, this requires that N_(g) be large. Therefore, itcannot be guaranteed that the small groups are represented sufficientlyin the sample. Hence, only guarantees for the big groups are provided.Workload information can also be leveraged for the case of small groupsas described below. If the selection (similar results are valid forgroup-bys) condition is independent of the aggregate column and if theselectivity is not too small, then uniform sampling works well,otherwise weighted sampling is used.

[0097] Exploiting Weighted Samples

[0098] With weighted sampling, sampling is performed more from groupswhich though small in size, are important, i.e., have high usage. Usageof a database is typically characterized by considerable locality in theaccess pattern. Queries against the database access certain parts of thedata more than others. Normalizing the aggregate when weights areassociated with tuples is more complex since there is no single samplingfraction. With each tuple that is included in the sample 610 in FIG. 6,the probability with which it was accepted in the sample is associatedor stored at 620. The inverse of this probability is the multiplicationfactor associated with the tuple. This essentially says that if a tuplewas included with probability p then it is a representative of 1/psimilar tuples. Each aggregate computed over this tuple is multiplied bythis multiplication factor at 630. In the (degenerate) case of uniformsampling since the probability is the same for each tuple themultiplication factor is the same for all the tuples.

[0099] Weighted sampling works well if (a) the access pattern of queriesis local (most of the queries access a small part of the relation) and(b) the workload is a good representation of the actual queries whichwill be asked in future. To quantify the benefit of workload-drivenweighted sampling, let G be the set of tuples accessed by a query Q. Ifthe query Q is such that the average weight of a tuple in G issignificantly higher than the overall average weight, then the effectivesampling fraction is much higher for this query as compared to theoverall sampling rate. This is quantified in the following lemma.

[0100] Consider a relation R (fact table). Let Q denote the subset oftuples from R that are accessed (satisfy the selection condition) bysome query. Let W_(Q) denote the average weight of a tuple in Q, and wdenote the average weight of all tuples in R. Using weighted samplingwith an overall sampling fraction p, the effective sampling fraction forthe query equals pw_(Q)/w.

[0101] Thus, weighted sampling gives better accuracy for queries whichaccess tuples that have higher weights (high w_(Q)/w ratio). As aresult, better accuracy is obtained through effectively doing moresampling although the execution time remains the same since the samplesize for the fact table is the same as in the case of uniform sampling.In other words the sample contains more tuples that are pertinent to thequery and less of the sampled tuples are thrown away (do not pass theselection condition).

[0102] Selecting Outlier Indexes According to Workload

[0103] A query with selection or group-by implicitly defines one or moresub-relations over which aggregates are computed. The set of outliersfor the sub-relation produced by a selection condition may be completelydifferent from the outliers for the entire relation (global outliers).In fact, they may not even belong to the set of global outliers.Ideally, in such situations, we would like to have an outlier index foreach possible sub-relation in order to answer aggregate queriespertaining to it, but that would be prohibitively expensive.

[0104] Consider the case where the selection condition (or group-by) isindependent of the column that is being aggregated. In that case, onewould expect that tuples that are outliers in the sub-relation(s)created by the selection or the group-by are outliers for the entirerelation too and are a part of the global outlier index. Therefore,applying the selection condition to the global outlier index will yieldan outlier index for the sub-relation. However, the independenceassumption need not be valid in many situations. In fact, theindependence assumption is less likely to be valid whenever thesub-relation size is small, e.g., when a condition or a group has lowselectivity, or a range query on the aggregate column itself. We outlinean approach that exploits workload information to define an outlierindex that can cope in the absence of the independence assumption.

[0105] If the independence assumption does not hold, we are left with nochoice but to index the outliers for each sub-relation and consolidateeverything into a single global index. Since we cannot afford to makethe outlier index too big, we have to ration the available space betweendifferent indexes corresponding to sub-relations. Let {R₁, R₂, . . . ,R_(q)} be the (implicit) sub-relations corresponding to the differentselection conditions and group-by's. If there is no selection orgroup-by, then the corresponding sub-relation is the entire relation.Let {w₁, w₂, . . . , w_(q)} be the weights (number of queries thatqueried the sub-relation) associated with these sub-relations for thegiven workload. Let T be the total available memory for building anoutlier index. The problem is to ration T among the sub-relations forbuilding indexes.

[0106] In this section we see how we can do this allocation amongdifferent sub-relations efficiently. It should be noted that the schemethat we present is based on heuristics.

[0107] To allocate the memory among the different sub-relations wepropose the following steps as shown in FIG. 7:

[0108] 1. Collect information about the different sub-relations {R₁, R₂,. . . , R_(q)} and their weights {w_(l), w₂, . . . , w_(q)} using theworkload (710).

[0109] 2. For each sub-relation R_(i) calculate the variance S(R_(i)) inthe aggregate column for that sub-relation. The error corresponding tothis sub-relation is proportional to S(R_(i)) (720).

[0110] 3. Retain only those sub-relations for which the weightedvariance w_(i)S(R_(i)) is greater than a certain user-specifiedthreshold. This step prunes out or eliminates those sub-relations forwhich the data distribution does not warrant an outlier index or thosethat are not queried often. In other words, they have a weightedvariance below the threshold (730).

[0111] 4. Allocate the given memory among all remaining sub-relations inproportion to their weighted variances (w_(i)S(R_(i))) (740). Using thismemory bound, build an outlier index for each sub-relation (750) andtake the union of the outlier indexes over all sub-relations for whichoutlier indexes were built (760).

[0112] Several improvements to the basic scheme presented above arepossible. For example, under certain circumstances it is possible toexploit overlap among sub-relations. However, the basic scheme suffersfrom the complication that the available memory allocated to outlierindexes may not be fully exploited. This would require having to repeatSteps 3 and 4 with lower user-specified threshold. In an alternativescheme, where it is possible to globally order the outliers, from themost significant to the least, and index them in this order until theavailable memory is fully used up.

[0113] Finally, we note that to build a robust system, we need amechanism to decide for a given query whether our samples and outlierindexes are good enough (i.e., give acceptable error) for answering it.We suggest the following mechanism: We use the sample of thenon-outliers to get an estimate of the variance in the aggregate valuesover the non-outliers in the sub-relation that is queried. Using thisestimate and the number of samples of the non-outliers we can estimatethe error using equation in Theorem 1. This error estimate can be usedto decide, perhaps with user input, whether we should use thecombination of samples and outlier index to answer the query or if weshould answer it exactly using the data itself.

[0114] The aggregation of the outlier index is then combined with thesample to arrive at the overall estimate of the query as above.

[0115] Conclusion

[0116] Estimating aggregation queries is done by first identifyingoutlier values, aggregating the outlier values, and then sampling theremaining data after pruning the outlier values. The sampled data isextrapolated and added to the aggregated outlier values to provide anestimate for each aggregation query.

[0117] The use of outlier values combined with sampling provides theability to use sampling-based estimations over data having skew inaggregate values, low selectivity or ad-hoc queries and where there aresmall groups in group-by queries with better accuracy, and withoutlosing the benefits of sampling in reducing system resourcerequirements.

[0118] For queries having low selectivity or group by, weighted samplingis used, and may be combined with an outlier index based on weights ofsub-relations. Sampling is performed on the entire relation, and thenthe aggregates are multiplied by the inverse of the probabilitiesassociated with each sub-relation. The variance of each sub-relation isused to prune sub-relations, and the remaining sub-relations are used tocalculate an outlier index for each one. The outlier indexes are thensubjected to a union, to form a single outlier index. The aggregate fromthis single outlier index is then combined with the aggregate from theweighted sample to provide a final estimate of the aggregation.

[0119] This application is intended to cover any adaptations orvariations of the present invention. It is manifestly intended that thisinvention be limited only by the claims and equivalents thereof SQL wasused to query the data and create indexes and sampling. Other forms ofrelational database query language and various programming languagescould be used as known to those of skill in the art.

We claim:
 1. A method of estimating the results of a database query, themethod comprising: collecting workload information related to thedatabase; tracing query patterns of queries in the workload to identifythe usage of tuples in the database during execution of the queries;determining sample weights based on tuple usage; and performing aweighted sampling of the database based upon the sample weights.
 2. Themethod of claim 1 wherein the weighting sampling is based on aprobability of usage of tuples required in executing the workload. 3.The method of claim 2 and further comprising computing an aggregate overvalues in each sampled tuple.
 4. The method of claim 3 wherein theaggregate is computed by multiplying each value by the inverse of theprobability with which corresponding tuples were sampled.
 5. The methodof claim 1 wherein the weights are a function of the frequency of accessof a tuple and the number of queries in the workload that access thetuple.
 6. The method of claim 1 wherein the tuple usage is stored on apage level.
 7. A machine readable medium having instructions for causinga machine to perform a method of estimating the results of a databasequery, the method comprising: collecting workload information related tothe database; tracing query patterns of queries in the workload toidentify the usage of tuples in the database during execution of thequeries; determining sample weights based on tuple usage; and performinga weighted sampling based upon tuple usage.
 8. The machine readablemedium of claim 7 wherein the weights are a function of the frequency ofaccess of a tuple and the number of queries in the workload that accessthe tuple.
 9. The method of claim 7 wherein the tuple usage is stored ona page level.
 10. A system that estimates the results of a databasequery, the method comprising: a module that collects workloadinformation related to the database; a module that traces query patternsof queries in the workload to identify the usage of tuples in thedatabase during execution of the queries; a module that determinessample weights based on tuple usage; and a module that performs aweighted sampling of the database based upon the sample weights.
 11. Amethod of estimating the results of a database query, the methodcomprising: performing a weighted sampling of tuples in the databasebased on a probability of usage of tuples required in executing a givenworkload; storing the probability for each tuple sampled; computing anaggregate over values in each sampled tuple while multiplying by theinverse of the probability with which each tuple was sampled.
 12. Themethod of claim 11 wherein the weights are a function of the frequencyof access of a tuple and the number of queries in the workload thataccess the tuple.
 13. The method of claim 11 wherein the tuple usage isstored on a page level.
 14. A system that estimates the results of adatabase query, the system comprising: means for performing a weightedsampling of tuples in the database based on a probability of usage oftuples required in executing a given workload; means for storing theprobability for each tuple sampled; means for computing an aggregateover values in each sampled tuple while multiplying by the inverse ofthe probability with which each tuple was sampled.
 15. A machinereadable medium having instructions for causing a machine to perform amethod of estimating the results of a database query, the methodcomprising: performing a weighted sampling of tuples in the databasebased on a probability of usage of tuples required in executing a givenworkload; storing the probability for each tuple sampled; computing anaggregate over values in each sampled tuple while multiplying by theinverse of the probability with which each tuple was sampled.
 16. Amethod of sampling tuples in a database to estimate the answer of anaggregation query, the method comprising: determining which tuples areaccessed more often during execution of a workload; sampling the tuples;and calculating an aggregate of values in the sampled tuples as afunction of which tuples are accessed more often.
 17. A method ofgenerating an outlier index for a database and a given workload whereinthe queries in the workload may have selection or group by conditions,the method comprising: identifying sub-relations of tuples in thedatabase induced by selection and group by conditions in queries in theworkload; generating a variance for values in each sub-relation;selecting sub-relations having higher variances; and generating outliersfrom such sub-relations having higher variances.
 18. The method of claim17 and further comprising taking a union of outliers generated from suchsub-relations.
 19. The method of claim 17 wherein sub-relations areselected having a variance higher than a desired threshold.
 20. Amachine readable medium having instructions for causing a machine toperform a method of generating an outlier index for a database and agiven workload wherein the queries in the workload may have selection orgroup by conditions, the method comprising: identifying sub-relations oftuples in the database induced by selection and group by conditions inqueries in the workload; generating a variance for values in eachsub-relation; selecting sub-relations having higher variances; andgenerating outliers from such sub-relations having higher variances. 21.A system that generating an outlier index for a database and a givenworkload wherein the queries in the workload may have selection or groupby conditions, the method comprising: a module that identifiessub-relations of tuples in the database induced by selection and groupby conditions in queries in the workload; a module that generates avariance for values in each sub-relation; a module that selectssub-relations having higher variances; and a module that generatesoutliers from such sub-relations having higher variances.
 22. A methodof generating an outlier index for a database and a given workloadwherein the queries in the workload may have aggregation and selectionor group by conditions, the system comprising: identifying sub-relationsof tuples; generating weights for each sub-relation based on workloadinformation; generating a weighted variance for values in an aggregationcolumn in each sub-relation; allocating memory to sub-relations inproportion to their weighted variances; and generating outliers fromsuch sub-relations based on allocated memory.
 23. The method of claim 22and further comprising a module that takes a union of outliers generatedfrom such sub-relations.
 24. The system of claim 22 whereinsub-relations are selected having a weighted variance higher than adesired threshold.
 25. The method of claim 22 wherein the weights are afunction of a number of queries in the workload that reference thesub-relation.
 26. A method of generating an outlier index for a databaseand a given workload wherein the queries in the workload may haveselection or group by conditions, the method comprising: identifyingsub-relations of tuples having values to be aggregated; generatingweights for sub-relations based on workload information; generating aweighted variance for values in sub-relations; selecting sub-relationshaving higher weighted variances; and generating outliers from suchsub-relations having higher weighted variances.
 27. The method of claim26 and further comprising taking a union of outliers generated from suchsub-relations.
 28. The method of claim 26 wherein sub-relations areselected having a weighted variance higher than a desired threshold. 29.The method of claim 26 wherein the weights are a function of a number ofqueries in the workload that reference the sub-relation.
 30. A machinereadable medium having instructions for causing a machine to perform amethod of generating an outlier index for a database and a givenworkload wherein the queries in the workload may have selection or groupby conditions, the method comprising: identifying sub-relations oftuples having values to be aggregated; generating weights forsub-relations based on workload information; generating a weightedvariance for values in the sub-relations; selecting sub-relations havinghigher weighted variances; and generating outliers from suchsub-relations having higher weighted variances.
 31. A system thatgenerates an outlier index for a database and a given workload whereinthe queries in the workload may have selection or group by conditions,the system comprising: means for identifying sub-relations of tupleshaving values to be aggregated; means for generating weights for eachsub-relation based on workload information; means for generating aweighted variance for values in each sub-relation; means for selectingsub-relations having higher weighted variances; and means for generatingoutliers from such sub-relations having higher weighted variances.
 32. Amethod of estimating the results of a database query over a relation,the method comprising: defining weights of sub-relations using workloadinformation; calculating a weighted variance for each sub-relation;allocating memory to sub-relations in proportion to respective weightedvariances; building an outlier index for each sub-relation in accordancewith allocated memory; and taking the union of outlier indexes.
 33. Themethod of claim 32 and further comprising retaining sub-relations havinga weighted variance over a desired threshold prior to allocating memoryto sub-relations.
 34. The method of claim 32 wherein each weightedvariance is a function of the weight of a sub-relation and the varianceof the sub-relation.
 35. A method of estimating the results of adatabase and a given workload wherein the queries in the workload mayhave selection conditions, the method comprising: collecting workloadinformation related to the database; tracing query patterns of queriesin the workload to identify the usage of tuples in the database duringexecution of the queries; determining sample weights based on tupleusage; performing a weighted sampling of the database based upon thesample weights; and generating a weighted outlier index.
 36. The methodof claim 35 and further comprising calculating an aggregate based on thesamples and the index.
 37. A method of estimating an aggregate result ofa database and a given workload wherein the queries in the workload mayhave selection or group by conditions, the method comprising: collectingworkload information related to the database; tracing query patterns ofqueries in the workload to identify the usage of tuples in the databaseduring execution of the queries; generating sample weights based ontuple usage; performing a weighted sampling based upon tuple usage;generating an aggregate based on the weighted sampling; identifyingsub-relations of tuples in the database induced by selection and groupby conditions in queries in the workload; generating weights for eachsub-relation based on workload information; generating a weightedvariance for values in an aggregation column in each sub-relation;allocating memory to sub-relations in proportion to their weightedvariances; generating outlier indexes from such sub-relations based onallocated memory; performing a union on the outlier indexes to form anoutlier index for the relation; computing an aggregate on the outlierindex for the relation; and combining the aggregate based on theweighted sampling with the aggregate on the outlier index.
 38. A methodof estimating the results of a database and a given workload wherein thequeries in the workload may have selection conditions, the methodcomprising: building an outlier index on outlier values; building asample of non-outlier values; aggregating the outlier values andnon-outlier values and scaling values as required.
 39. The method ofclaim 38 wherein the outlier index is built using workload information.40. The method of claim 38 wherein the non-outlier sample is based onuniform sampling.
 41. The method of claim 38 wherein the non-outliersample is based on weighted workload information.